This notebook is the second delivery for Data Valorization course, in Université Cote de Azur, Nice (2023). You can watch the explanation for the aim of the project and the data cleaning here.
Business goal: Analyze how social features( race, gender)and education affect STEM salaries.
Technical goal: Perform regression to predict total yearly income
We will perform the regression with the following models, looking to obtain the lowest RSME:
Linear Regression
SVR
Regression Tree
Random Forest
options(warn=-1) # We want to generate a nice-looking notebook without warnings
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.1 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.1 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly) #interactive visualisation
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(lattice) #data exploration (corellations)
library(dplyr)
library(countrycode) # For adding continent variable
library(corrplot)
## corrplot 0.92 loaded
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caTools)
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
##
## The following object is masked from 'package:base':
##
## Recall
library(e1071)
The salary dataset we have used is scraped from Levels Fyi, a US-based online service for compensation comparison across companies in Tech. It contained 62 642 samples and 29 variables, predominantly categorical. We import our cleaned dataset from the first delivery, where we performed the preprocessing and one-hot encoding.
We import our cleaned dataset from the first delivery:
df_onehot <- read.csv('data/clean_salary_data.csv')
df_onehot <- na.omit(df_onehot) #Ensure we dont have NA values
We will feed our regression models with the one-hot encoded variables:
df_onehot
dim(df_onehot)
## [1] 21589 37
We define our output variable Y - totalyearlycompensation, meaning how much a person X earns during the whole year:
Y <- df_onehot[c("totalyearlycompensation")]
df_onehot <- subset(df_onehot, select = -c(totalyearlycompensation))
We split the data into training and testing sets:
set.seed(123)
trainIndex <- sample(1:nrow(df_onehot), 0.7 * nrow(df_onehot))
Ytrain <- Y[trainIndex,]
trainData <- df_onehot[trainIndex,]
Ytest <- Y[-trainIndex,]
testData <- df_onehot[-trainIndex,]
Fit the linear regression model using the training and predict using the testing data:
linearModel <- lm(Ytrain ~ ., data = trainData)
predictions <- predict(linearModel, newdata = testData)
Evaluate our model plotting the residuals and calculating the RMSE:
residuals <- Ytest - predictions
plot(residuals ~ predictions, xlab = "Predictions", ylab = "Residuals",
main = "Residuals vs. Predictions")
abline(h = 0, col = "red")
rmse <- sqrt(mean(residuals^2))
R2_LR <- 1 - sum((residuals)^2) / sum((Ytest - mean(Ytest))^2)
cat("Root mean Squared Error of Linear Regression:", rmse, "\n")
## Root mean Squared Error of Linear Regression: 91757.71
cat("R squared coefficient of Linear Regression:", R2_LR, "\n")
## R squared coefficient of Linear Regression: 0.4600771
summary(Y)
## totalyearlycompensation
## Min. : 10000
## 1st Qu.: 119000
## Median : 173000
## Mean : 197425
## 3rd Qu.: 245000
## Max. :2372000
Finally, we analyze the model:
summary(linearModel)
##
## Call:
## lm(formula = Ytrain ~ ., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -365054 -50086 -13849 30283 1834341
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23557.2 108953.7 0.216 0.828825
## ed_bachelor -50609.2 54090.8 -0.936 0.349477
## ed_master -35509.7 54082.9 -0.657 0.511462
## ed_doctor 20955.4 54248.0 0.386 0.699288
## race_asian 27628.4 93675.7 0.295 0.768046
## race_white 25232.1 93692.3 0.269 0.787696
## race_hispanic 15116.0 93619.9 0.161 0.871732
## race_black 16926.8 93786.4 0.180 0.856776
## race_two_or_more 34239.4 93773.8 0.365 0.715021
## yearsofexperience 8944.3 164.0 54.554 < 2e-16 ***
## yearsatcompany -1786.8 278.7 -6.412 1.48e-10 ***
## ed_other -60383.4 54296.5 -1.112 0.266111
## gender_female -32165.6 10861.9 -2.961 0.003068 **
## gender_male -20540.7 10733.0 -1.914 0.055667 .
## gender_other NA NA NA NA
## com_oracle 41656.1 8101.7 5.142 2.76e-07 ***
## com_other 46423.1 5973.4 7.772 8.25e-15 ***
## com_amazon 81059.7 6303.2 12.860 < 2e-16 ***
## com_apple 117429.4 7487.4 15.684 < 2e-16 ***
## com_salesforce 87244.8 8561.3 10.191 < 2e-16 ***
## com_facebook 188222.5 6889.5 27.320 < 2e-16 ***
## com_google 134166.3 6664.9 20.130 < 2e-16 ***
## com_intel 24415.6 8610.3 2.836 0.004580 **
## com_microsoft 58023.6 6496.9 8.931 < 2e-16 ***
## com_ibm NA NA NA NA
## title_pmanager 35163.6 5169.3 6.802 1.07e-11 ***
## title_sweng 15427.7 4321.0 3.570 0.000358 ***
## title_swengman 76961.9 5554.7 13.855 < 2e-16 ***
## title_datasci 9571.3 5652.9 1.693 0.090442 .
## title_other -17903.6 4644.4 -3.855 0.000116 ***
## title_hweng NA NA NA NA
## continent_namerica 94840.0 3054.0 31.054 < 2e-16 ***
## continent_samerica -47260.8 16273.1 -2.904 0.003687 **
## continent_africa 34400.3 31467.3 1.093 0.274321
## continent_oceania 48670.2 11198.2 4.346 1.39e-05 ***
## continent_asia -30414.1 3989.2 -7.624 2.60e-14 ***
## continent_europe NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93600 on 15079 degrees of freedom
## Multiple R-squared: 0.4497, Adjusted R-squared: 0.4485
## F-statistic: 385 on 32 and 15079 DF, p-value: < 2.2e-16
SVR is a type of supervised machine learning algorithm used for classification and regression analysis.
We implement 6 different models:
svmModel_nu <- svm(Ytrain ~ ., data = trainData, kernel = "linear", type = "nu-regression")
summary(svmModel_nu)
##
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "linear", type = "nu-regression")
##
##
## Parameters:
## SVM-Type: nu-regression
## SVM-Kernel: linear
## cost: 1
## nu: 0.5
##
## Number of Support Vectors: 7576
coef(svmModel_nu)
## (Intercept) ed_bachelor ed_master ed_doctor
## -0.097563025 -0.038727556 0.012030658 0.084853103
## race_asian race_white race_hispanic race_black
## 0.014890844 -0.007536994 -0.014142729 -0.014134388
## race_two_or_more yearsofexperience yearsatcompany ed_other
## 0.009076933 0.354937948 -0.052757125 -0.022627934
## gender_female gender_male gender_other com_oracle
## -0.011780481 0.010580065 0.005820983 -0.023680391
## com_other com_amazon com_apple com_salesforce
## -0.114341764 0.027683795 0.057420485 0.015392605
## com_facebook com_google com_intel com_microsoft
## 0.156081922 0.104911952 -0.038831414 -0.021549315
## com_ibm title_pmanager title_sweng title_swengman
## -0.067304670 0.031211252 0.005795747 0.088753323
## title_datasci title_other title_hweng continent_namerica
## -0.005109305 -0.069320591 -0.012717556 0.160482699
## continent_samerica continent_africa continent_oceania continent_asia
## -0.027190088 -0.001593356 0.009465945 -0.149976481
## continent_europe
## -0.061930428
# Test your SVM model on the testing set
svmPred_nu <- predict(svmModel_nu, testData)
# Evaluate the accuracy of your SVM model: RMSE
RMSE_nu <- sqrt(mean((svmPred_nu - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_nu, "\n")
##
## Root Mean Squared Error: 93505.17
svmModel_eps <- svm(Ytrain ~ ., data = trainData, kernel = "linear", type="eps-regression")
summary(svmModel_eps)
##
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "linear", type = "eps-regression")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 1
## gamma: 0.02777778
## epsilon: 0.1
##
##
## Number of Support Vectors: 12228
coef(svmModel_eps)
## (Intercept) ed_bachelor ed_master ed_doctor
## -0.139066544 -0.104691733 -0.053295831 0.055092885
## race_asian race_white race_hispanic race_black
## 0.108625940 0.081531877 0.025630440 0.013201953
## race_two_or_more yearsofexperience yearsatcompany ed_other
## 0.044532632 0.357583565 -0.049856955 -0.047258669
## gender_female gender_male gender_other com_oracle
## -0.008572223 0.007496459 0.005333605 -0.016992092
## com_other com_amazon com_apple com_salesforce
## -0.117764513 0.024842040 0.058950577 0.017955597
## com_facebook com_google com_intel com_microsoft
## 0.154430481 0.103931083 -0.047379113 -0.017624466
## com_ibm title_pmanager title_sweng title_swengman
## -0.053749870 0.028654364 0.003724878 0.097217183
## title_datasci title_other title_hweng continent_namerica
## -0.007392478 -0.070825327 -0.008278636 0.157381473
## continent_samerica continent_africa continent_oceania continent_asia
## -0.027962620 -0.004981926 0.012755926 -0.150943598
## continent_europe
## -0.056632805
svmPred_eps <- predict(svmModel_eps, testData)
RMSE_eps <- sqrt(mean((svmPred_eps - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_eps, "\n")
##
## Root Mean Squared Error: 94267.23
svmModel_nu_ra <- svm(Ytrain ~ ., data = trainData, kernel = "radial",type = "nu-regression")
summary(svmModel_nu_ra)
##
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "radial", type = "nu-regression")
##
##
## Parameters:
## SVM-Type: nu-regression
## SVM-Kernel: radial
## cost: 1
## nu: 0.5
##
## Number of Support Vectors: 8034
svmPred_nu_ra <- predict(svmModel_nu_ra, testData)
RMSE_nu_ra <- sqrt(mean((svmPred_nu_ra - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_nu_ra, "\n")
##
## Root Mean Squared Error: 90429.82
svmModel_eps_ra <- svm(Ytrain ~ ., data = trainData, kernel = "radial", type = "eps-regression")
summary(svmModel_eps_ra)
##
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "radial", type = "eps-regression")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.02777778
## epsilon: 0.1
##
##
## Number of Support Vectors: 11993
svmPred_eps_ra <- predict(svmModel_eps_ra, testData)
RMSE_eps_ra <- sqrt(mean((svmPred_eps_ra - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_eps_ra, "\n")
##
## Root Mean Squared Error: 90905.04
svmModel_nu_po <- svm(Ytrain ~ ., data = trainData, kernel = "polynomial", type = "nu-regression")
summary(svmModel_nu_po)
##
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "polynomial",
## type = "nu-regression")
##
##
## Parameters:
## SVM-Type: nu-regression
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
## nu: 0.5
##
## Number of Support Vectors: 8188
svmPred_nu_po <- predict(svmModel_nu_po, testData)
RMSE_nu_po <- sqrt(mean((svmPred_nu_po - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_nu_po, "\n")
##
## Root Mean Squared Error: 91844.69
svmModel_eps_po <- svm(Ytrain ~ ., data = trainData, kernel = "polynomial", type = "eps-regression")
summary(svmModel_eps_po)
##
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "polynomial",
## type = "eps-regression")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.02777778
## coef.0: 0
## epsilon: 0.1
##
##
## Number of Support Vectors: 12146
svmPred_eps_po <- predict(svmModel_eps_po, testData)
RMSE_eps_po <- sqrt(mean((svmPred_eps_po - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_eps_po, "\n")
##
## Root Mean Squared Error: 92446.72
This graph compares the real salary values and predictions for the four models:
# Create a data frame with the scaled salaries and predictions
dfgraph <- data.frame(scaled_salaries = Ytest, pred_nu =svmPred_nu, pred_eps =svmPred_eps,pred_nu_ra=svmPred_nu_ra,pred_eps_ra=svmPred_eps_ra)
# Sort the data frame by the scaled salaries for plotting
dfgraph <- dfgraph[order(dfgraph$scaled_salaries), ]
# Create a line plot of the scaled salaries and predictions
plot_ly(dfgraph, x = ~scaled_salaries) %>%
add_lines(y = ~pred_eps, name = "Linear kernel + nu-regression", line = list(color = "red", width = 2)) %>%
add_lines(y = ~pred_nu_ra, name = "Linear kernel + eps-regression", line = list(color = "green", width = 2)) %>%
add_lines(y = ~pred_eps_ra, name = "Radial kernel + nu-regression", line = list(color = "blue", width = 2)) %>%
add_lines(y = ~pred_eps_ra, name = "Radial kernel + eps-regression", line = list(color = "yellow", width = 2)) %>%
layout(xaxis = list(title = "Real scaled Salaries"), yaxis = list(title = "Predicted scaled Salaries"), ylim = c(0, 6))
The model with the lowest RMSE is svmModel_nu_ra (radial kernel + nu-regression), followed closely by svmModel_eps_ra (radial kernel + eps-regression).
The residual plot of the best model is:
residuals <- Ytest - svmPred_nu_ra
plot(svmPred_nu_ra, residuals, main="Residual Plot", xlab="Predicted Values", ylab="Residuals")
abline(h = 0, col = "red")
And the Predicted vs Actual Values graph is:
lims <- range(c(Ytest, svmPred_nu_ra))
plot(Ytest, svmPred_nu_ra, main = "Predicted vs Actual Values (radial + nu regression)", xlab = "Actual Values", ylab = "Predicted Values", xlim = lims, ylim = lims)
abline(0, 1, col = "red")
To have an idea of how good/bad our results are, let’s have a look at the summary of the output variable:
summary(Ytest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10000 117000 174000 197249 245000 1733000
We calculate the R-squared value of our best model:
R2_nu_ra <- 1 - sum((Ytest - svmPred_nu_ra)^2) / sum((Ytest - mean(Ytest))^2)
cat("Root mean Squared Error of SVR:", RMSE_nu_ra, "\n")
## Root mean Squared Error of SVR: 90429.82
cat("R squared coefficient of SVR:", R2_nu_ra, "\n")
## R squared coefficient of SVR: 0.4755912
We continued analysis using the regression trees and random forest methods:
full_tree <- rpart(Ytrain~., data=trainData, method='anova')
rpart.plot(full_tree)
For all the trees we have tried to fine tune the parameters in order to obtain the lowest possible error and to avoid the overfitting. At the end we have obtained the smallest error (RMSE) for the trees that we kept the parameters set by default. Below we show as the example one of our trials:
## Attempts in finding the cp parameter - the optimal one - for building the trees
## Calculated according to the lab 8
library(MLmetrics)
alt_full_tree <- rpart( Ytrain~.,
data = trainData,
method='anova',
control = rpart.control(
xval = 10, # number of cross-validations
minbucket = 2, # the minimum number of observations in any terminal (leaf) node
cp = 0.0)
)
optimalCp = alt_full_tree$cptable[which.min(alt_full_tree$cptable[,4]),1]
alt_full_tree_Optimal <- prune(alt_full_tree, cp=optimalCp)
#Plot the decision tree
rpart.plot(alt_full_tree_Optimal,type=3)
y_pred_tunedtree = predict(newdata=testData, object=alt_full_tree)
rmse_tunedtree = sqrt(MSE(y_pred_tunedtree, Ytest)) # ~10k
sprintf("RMSE for the tunedtree: %s", rmse_tunedtree)
## [1] "RMSE for the tunedtree: 103622.941400866"
Moreover, we build a tree only based on the social features: race, gender, education. Manipulating the default cp we obtained the best result:
social_tree <- rpart( Ytrain ~ ed_master+ed_bachelor+ed_doctor+ed_other+
race_asian+race_hispanic+race_white+race_black+race_two_or_more+
gender_female+gender_male+gender_other,
data=trainData, method='anova', control = rpart.control(cp = 0.005))
rpart.plot(social_tree)
As it was the most influential feature, we continue by removing education from the social features:
non_ed_tree <- rpart(Ytrain~race_asian+ race_hispanic+race_white+ race_black+ race_two_or_more+
gender_female+ gender_other + gender_male,
data=trainData, method='anova', control = rpart.control(cp = 0.0005))
rpart.plot(non_ed_tree)
As an extension, we also implement Random Forest analysis:
classifier_RF = randomForest(x = trainData, y= Ytrain, ntree=50, keep.forest=TRUE)
y_pred_RF = predict(newdata=testData, object=classifier_RF)
classifier_RT <- rpart(Ytrain~., data=trainData, method="anova" )
y_pred_RT = predict(newdata=testData, object=classifier_RT)
Comparing the chosen accuracy metric (RMSE) between regression tree and random forest:
rmse_RF = sqrt(MSE(y_pred_RF, Ytest)) # ~10k
rmse_RT = sqrt(MSE(y_pred_RT, Ytest)) # ~100k
R2_RF <- 1 - sum((Ytest - y_pred_RF)^2) / sum((Ytest - mean(Ytest))^2)
R2_RT <- 1 - sum((Ytest - y_pred_RT)^2) / sum((Ytest - mean(Ytest))^2)
Show all the metrics:
sprintf("RMSE for the regression tree method: %s", rmse_RT)
## [1] "RMSE for the regression tree method: 101918.807393375"
sprintf("R squared score for the regression tree method: %s", R2_RT)
## [1] "R squared score for the regression tree method: 0.333875655465425"
sprintf("RMSE for the random forest method: %s", rmse_RF)
## [1] "RMSE for the random forest method: 89600.6147020292"
sprintf("R squared score for the random forest method: %s", R2_RF)
## [1] "R squared score for the random forest method: 0.485164343137025"
As aspiring data scientists, we wanted to know what our future salaries could look like. If properly implemented and tuned, this tool could be very useful for prospective STEM professionals to obtain a reference of what salary to ask for in job interviews, taking into account all variables discussed, including their education and professional careers. It can also be useful to choose a specialization, and a place to work to obtain the best possible salary.
Moreover, as part of any minority, it is useful to be aware of the effect of prejudice in prospective salaries, to be able to ask for a fair compensation.
We will begin by creating a dataframe with our characteristics:
# Ewa, Marina, Tymoteusz, LucĂa
ed_bachelor <- c(0,0,0,0)
ed_doctor <- c(0,0,0,0)
ed_other <- c(0,0,0,0)
ed_master <- c(1,1,1,1)
race_asian <- c(0,0,0,0)
race_hispanic <- c(0,0,0,0)
race_black <- c(0,0,0,0)
race_two_or_more <- c(0,0,0,0)
race_white <- c(1,1,1,1)
yearsofexperience <- c(1,0,2,3)
yearsatcompany <- c(0,0,0,0)
gender_female <- c(1,1,0,1)
gender_male <- c(0,0,1,0)
gender_other <- c(0,0,0,0)
com_oracle <- c(0,0,0,0)
com_other <- c(1,0,1,1)
com_amazon <- c(0,0,0,0)
com_apple <- c(0,0,0,0)
com_salesforce <- c(0,0,0,0)
com_facebook <- c(0,0,0,0)
com_google <- c(0,0,0,0)
com_intel <- c(0,0,0,0)
com_microsoft <- c(0,0,0,0)
com_ibm <- c(0,0,0,0)
title_pmanager <- c(0,0,0,0)
title_sweng <- c(0,0,1,0)
title_swengman <- c(0,0,0,0)
title_datasci <- c(1,1,0,1)
title_other <- c(0,0,0,0)
title_hweng <- c(0,0,0,0)
continent_namerica <- c(0,0,0,0)
continent_samerica <- c(0,0,0,0)
continent_africa <- c(0,0,0,0)
continent_oceania <- c(0,0,0,0)
continent_asia <- c(0,0,0,0)
continent_europe <- c(1,1,1,1)
#We include the characteristics in a dataframe
teamcharacteristics <- data.frame(ed_bachelor,ed_master,ed_other,ed_doctor,race_asian,race_white,race_hispanic,race_black,race_two_or_more,yearsofexperience,yearsatcompany,gender_female,gender_other,com_oracle,com_other,com_amazon,com_apple,com_salesforce,com_facebook,com_google,com_intel,com_microsoft,com_ibm,title_pmanager,title_sweng,title_swengman,title_datasci,title_other,title_hweng,continent_namerica,continent_samerica,continent_africa,continent_oceania,continent_asia,continent_europe,gender_male)
teamcharacteristics
names<- c('Marina','Ewa','Tymo','Lucia')
y_team_salaries = predict(newdata=teamcharacteristics, object=classifier_RF)
teampredictions <- data.frame(names,y_team_salaries)
teampredictions
As a summary we decided to gather all the metrics we used to evaluate the models and compare performance of the models:
| Model | RMSE | R-squared |
|---|---|---|
| Linear Regression | 91,757.71 | 0.4601 |
| SVR (radial kernel + nu-regression) | 90,429.82 | 0.4755 |
| Regression Tree | 101.918.80 | 0.3338 |
| Random forest | 89,600.61 | 0.4851 |
Average salary for all the people in our dataset is 197,425 USD.
Linear Regression, SVR and Regression Trees models all have on average
90k RMSE, this means on average 46% error in the prediction of a salary.
Due to such high inaccuracies, the abovementioned methods are not good
predictors for this problem.
However the last model, Random Forest that we created has around 89k
RMSE, meaning average error of 6%. This is a much better performance,
and in our opinion is a good predictor. Due to the relatively high
accuracy, we see its real value for the users and a potential business
value.